Improve PDF reader and SaveAsPdf compatibility by PrzemyslawKlys · Pull Request #1755 · EvotecIT/OfficeIMO

PrzemyslawKlys · 2026-04-08T10:49:31Z

Summary

improve PDF writer and Word-to-PDF robustness around footer rendering, output path validation, QuestPDF license restoration, and custom font registration retries
expand PDF reader and lightweight extractor compatibility for inherited page metadata/resources, nested forms, content arrays, compressed and filtered streams, inline dictionaries, escaped names, comments, string decoding, and predictor-based DecodeParms
add broad regression coverage for PDF reader/extractor and SaveAsPdf edge cases

Root cause

The PDF projects handled the library's own output well, but several external-PDF shapes and a few SaveAsPdf failure paths still relied on narrow assumptions around page inheritance, stream filters, string parsing, and cleanup behavior.

Validation

dotnet test OfficeIMO.Tests/OfficeIMO.Tests.csproj --filter "FullyQualifiedName~PdfReaderAndFooterRegressionTests"
dotnet test OfficeIMO.Tests/OfficeIMO.Tests.csproj --filter "FullyQualifiedName~Pdf"

Regression results were green on net8.0, net10.0, and net472.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c4c7387578

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T10:56:18Z

OfficeIMO.Pdf/Reading/Filters/StreamDecoder.cs

+        if (decodeParmsObj is PdfArray decodeParmsArray &&
+            filterIndex >= 0 &&
+            filterIndex < decodeParmsArray.Items.Count &&
+            ResolveDictionary(decodeParmsArray.Items[filterIndex], objects) is PdfDictionary indexedDict) {


Resolve DecodeParms refs before indexing filter params

GetDecodeParms only handles DecodeParms when it is a direct dictionary or a direct array. If a PDF uses an indirect array (for example /DecodeParms 12 0 R where object 12 is [null << /Predictor ... >>]), this branch is skipped, so predictor settings are never applied and Flate-decoded content remains PNG-predicted garbage. That causes text extraction to fail on otherwise valid filtered streams that store decode params indirectly.

Useful? React with 👍 / 👎.

codecov · 2026-04-08T10:58:32Z

Codecov Report

❌ Patch coverage is 60.81193% with 473 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.94%. Comparing base (e6b0074) to head (c4c7387).
⚠️ Report is 17 commits behind head on master.

Files with missing lines	Patch %	Lines
OfficeIMO.Pdf/Reading/PdfTextExtractor.cs	52.07%	208 Missing and 57 partials ⚠️
...ficeIMO.Pdf/Reading/Filters/PngPredictorDecoder.cs	40.74%	24 Missing and 8 partials ⚠️
OfficeIMO.Pdf/Reading/Core/TextContentParser.cs	73.27%	17 Missing and 14 partials ⚠️
OfficeIMO.Pdf/Reading/Core/PdfReadPage.cs	80.14%	17 Missing and 11 partials ⚠️
OfficeIMO.Pdf/Reading/Core/PdfSyntax.cs	68.53%	19 Missing and 9 partials ⚠️
OfficeIMO.Pdf/Reading/Core/PdfReadDocument.cs	41.02%	12 Missing and 11 partials ⚠️
OfficeIMO.Pdf/Reading/Filters/StreamDecoder.cs	75.43%	5 Missing and 9 partials ⚠️
OfficeIMO.Pdf/Reading/Filters/Ascii85Decoder.cs	66.66%	9 Missing and 4 partials ⚠️
OfficeIMO.Pdf/Reading/Filters/RunLengthDecoder.cs	50.00%	7 Missing and 4 partials ⚠️
OfficeIMO.Pdf/Reading/Filters/AsciiHexDecoder.cs	59.09%	4 Missing and 5 partials ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1755      +/-   ##
==========================================
+ Coverage   69.68%   69.94%   +0.25%     
==========================================
  Files         983      993      +10     
  Lines      115229   117591    +2362     
  Branches    23598    24288     +690     
==========================================
+ Hits        80300    82246    +1946     
- Misses      24839    24962     +123     
- Partials    10090    10383     +293

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 88fdde728a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T12:17:00Z

OfficeIMO.Pdf/Reading/PdfTextExtractor.cs

+        foreach (var kid in kids.Items) {
+            var child = ResolveDict(kid, objects);
+            if (child is not null) {
+                TraversePagesNode(child, objects, result, visited);
+            }


Guard page-tree traversal against cyclic Kids references

TraversePagesNode recursively descends through every /Kids entry but the visited set is only used for leaf /Page objects, not intermediate /Pages nodes. If a malformed or malicious PDF has a /Kids cycle (child points back to an ancestor), ExtractAllText will recurse indefinitely and can stack-overflow the process. Add a visited check (or depth guard) for each traversed node before recursing into children.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f93de88543

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T14:53:09Z

OfficeIMO.Pdf/Reading/Core/PdfSyntax.cs

-        return idx >= 0 ? idx + 6 : -1;
+        int searchFrom = start;
+        while (searchFrom >= 0 && searchFrom < text.Length) {
+            int streamIdx = IndexOfKeyword(text, "stream", searchFrom, text.Length);


Limit stream detection to real stream tokens

FindObjectEnd now treats the first occurrence of the substring stream as a stream start even inside ordinary object contents (for example, a metadata string like (mainstream)), then skips ahead to the next endstream in later objects. That can make object boundaries jump across objects and cause dictionaries to be misparsed as streams, corrupting the object map and breaking metadata/page parsing on valid PDFs containing the word "stream" in non-stream objects.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a30e38a337

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T19:01:18Z

OfficeIMO.Pdf/Reading/Core/PdfSyntax.cs

+        }
+
+        char c = text[idx];
+        return char.IsWhiteSpace(c) || c is '/' or '<' or '>' or '[' or ']' or '(' or ')' or '%';


Exclude name/string delimiters from stream keyword boundaries

FindObjectEnd relies on IndexOfKeyword("stream", ...), but HasKeywordBoundary currently treats /, (, and ) as valid token boundaries, so text like /stream (name value) or (stream) (literal string) inside a non-stream object is still recognized as a real stream marker. In that case the parser jumps ahead looking for endstream, which can shift object boundaries across later objects and corrupt the parsed object map for otherwise valid PDFs containing the standalone word stream in metadata or names.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a8123a966

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T19:53:19Z

OfficeIMO.Pdf/Reading/Core/PdfReadDocument.cs

+            }
+            return $"ref:{r.ObjectNumber}";
+        }
+        if (c is PdfStream) return "stream:direct";


Preserve unique keys for direct page content streams

ContentsKey now returns the constant "stream:direct" for every page whose /Contents is a direct stream, but TraversePagesNodeDeepLimited uses this key for de-duplication via contentKeys.Add(key). That means valid PDFs with multiple pages using direct (non-referenced) content streams will keep only the first page and silently drop the rest from Pages. Use a per-page unique key (or disable content-key dedupe for direct streams) so distinct pages are not collapsed.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7d9765d51d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T20:10:22Z

OfficeIMO.Pdf/Reading/Core/PdfReadDocument.cs

+            if (_objects.TryGetValue(r.ObjectNumber, out var ind) && ind.Value is PdfArray referencedArray) {
+                return ContentsArrayKey(referencedArray);
+            }


Preserve referenced content-array identity in page dedupe

When /Contents is an indirect reference to a PdfArray, this code now converts the key from ref:<objectNumber> to ContentsArrayKey(...), and TraversePagesNodeDeepLimited uses that key to drop “duplicates.” In valid PDFs, different page objects can point to different content-array objects that contain the same stream references (or reordered wrappers), and this change causes later pages to be silently skipped, reducing Pages and extracted text. Keep the indirect object identity in the dedupe key (or stop content-based dedupe) so distinct pages are not collapsed.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2fc7fa534f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-08T20:33:14Z

OfficeIMO.Pdf/Reading/Core/PdfReadPage.cs

+            bool trackRecursion = formObjectNumber > 0;
+            if (trackRecursion && !activeForms.Add(formObjectNumber)) {
+                continue;


Guard direct form XObjects against recursive Do cycles

The recursion guard only tracks forms when formObjectNumber > 0, so direct /XObject form streams (returned with objectNumber = 0) are never added to activeForms. A PDF that uses direct form dictionaries with self-reference or a direct-form cycle will recurse indefinitely in CollectTextAndForms, eventually stack-overflowing during GetTextSpans(). Please key recursion tracking for direct streams too (for example by object identity or resource name path), not only indirect object numbers.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-08T20:33:14Z

OfficeIMO.Pdf/Reading/PdfTextExtractor.cs

+        bool trackRecursion = formObjectNumber > 0;
+        if (trackRecursion && !activeForms.Add(formObjectNumber)) {
+            return string.Empty;


Prevent infinite recursion for direct forms in text extractor

The lightweight extractor has the same guard gap: it only adds forms to activeForms when an indirect object number exists, but TryGetFormStream sets direct form streams to objectNumber = 0. If a page/form resources tree contains direct form XObjects that invoke themselves (or each other), ExtractTextFromContentStream will recurse without termination and can crash with stack overflow. Track direct-form recursion as well, not just indirect references.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-08T20:56:12Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

chatgpt-codex-connector · 2026-04-08T21:05:33Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

Improve PDF reader and SaveAsPdf compatibility

c4c7387

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

PrzemyslawKlys added 5 commits April 8, 2026 13:55

Add TIFF predictor support for PDF streams

04ddbd3

Resolve indirect PDF filter metadata

77b1384

Resolve indirect PDF stream lengths

68f002f

Parse PDF booleans and null objects

0b7eb1c

Handle endobj literals inside PDF streams

88fdde7

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

Fix PDF review regressions and forked Claude CI

f93de88

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

Fix PDF parser token boundaries and font alias retries

a30e38a

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

Fix PDF parser keywords and custom font fallback

2a8123a

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

Fix Linux custom font retries in SaveAsPdf

7d9765d

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

Fix PDF page dedupe and Linux font retries

2fc7fa5

chatgpt-codex-connector bot reviewed Apr 8, 2026

View reviewed changes

PrzemyslawKlys added 2 commits April 8, 2026 22:44

Guard direct PDF form recursion

ab54807

Guard Skia font fallback in PDF rendering

0861b57

Fallback custom PDF font families from file paths

7f19981

PrzemyslawKlys merged commit d47a303 into EvotecIT:master Apr 9, 2026
9 checks passed

PrzemyslawKlys deleted the codex/pdf-review-worktree branch April 9, 2026 06:10

Uh oh!

Conversation

PrzemyslawKlys commented Apr 8, 2026

Summary

Root cause

Validation

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 8, 2026

Codecov Report

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot commented Apr 8, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant